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Abstract. The Minimum-Recombinant Haplotype Configuration 
problem (MRHC) has been highly successful in providing a sound com- 
binatorial formulation for the important problem of genotype phasing on 
" pedigrees. Despite several algorithmic advances and refinements that led 

^ to some efficient algorithms, its applicability to real datasets has been 

^^ limited by the absence of some important characteristics of these data in 

^^ its formulation, such as mutations, genotyping errors, and missing data. 

^^ In this work, we propose the Haplotype Configuration with Re- 

combinations AND Errors problem (HCRE), which generalizes the 
original MRHC formulation by incorporating the two most common 
characteristics of real data: errors and missing genotypes (including un- 
"~^ typed individuals). Although HCRE is computationally hard, we propose 

^ an exact algorithm for the problem based on a reduction to the well- 

known Satisfiability problem. Our reduction exploits recent progresses 
in the constraint programming literature and, combined with the use of 
state-of-the-art SAT solvers, provides a practical solution for the HCRE 
problem. Biological soundness of the phasing model and effectiveness 
r ■ (on both accuracy and performance) of the algorithm are experimen- 

tally demonstrated under several simulated scenarios and on a real dairy 
cattle population. 



1 Introduction 



C^ After the first draft of the human genome was published in 2000, a huge re- 

search effort has been devoted to the discovery of genetic differences among 
same-species individuals and to the characterization of their impact to the ex- 
pression of different phenotypic traits such as disease susceptibility or drug re- 
sistance. Moreover, the completion of several livestock genomes and the recent 
introduction of genomic data into breeding programs widened the interest of 
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these studies to other species. Most of these efforts are driven by the Interna- 



tional HapMap Project 11 , which discovered, investigated and characterized 
miUions of genomic positions (caUed loci or sites) where different individuals 
carry different genetic subsequences (caUed alleles). In practice, unordered pairs 
of alleles coming from both parents of each individual studied are routinely col- 
lected, since determining the parental source of each allele is too time-consuming 
and expensive to be performed on large studies jS] . The pairs of alleles located 
at a given set of loci of an individual are called the (multi-locus) genotype of the 
individual, while the sequence of alleles that were inherited from a single parent 
is called a haplotype. The advance of high-throughput and high-density genotyp- 
ing technologies, combined with a consistent reduction of genotyping costs, has 
led to a great abundance of genotypic data. Such genotypes (also called SNP 
genotypes) are generally biallelic (i.e., at each locus only two distinct alleles are 
observed in the population) and they will be the focus of this work. A number 
of association studies based on SNP genotypes have been carried out but, since 
haplotypes substantially increase the power of genetic variation studies jl3| , ac- 
curate and efficient computational prediction of haplotypes from genotypes is 
highly desirable. Mendelian inheritance laws, which govern the transmission of 
genetic material from parents to children, have been effectively used to improve 
the accuracy of haplotyping methods. The general problem that we have just 
described is called in the literature Haplotype Inference (HI) on pedigrees, and 
it asks for a haplotype configuration consistent with a given genotyped pedigree. 
While the problem has been successfully tackled by classic statistical methods 
(such as Lander-Green fsl and Elson-Stewart (i]), the increasing density and 
length of SNP genotypes pose new hurdles. In fact those methods are not de- 
signed to scale well on large datasets and they do not take directly into account 
the presence of Linkage Disequilibrium among loci in the founder population. 

Combinatorial formulations have been proposed to overcome such limita- 
tions. Since there can exist an exponential number of consistent haplotype con- 
figurations, we need an additional criterion for choosing the "right" haplotype 
configuration among those compatible with the input data. The genetic linkage 
between neighbouring loci has inspired a parsimonious formulation of the HI 
problem that looks for a configuration minimizing the number of recombination 
events in the resulting haplotyped pedigree. This formulation, called MiNlMUM- 
Recombinant Haplotype Configuration (MRHC) 6,9 , has been shown 
to be a successful approach. The aim of this formulation is the computation of 
a haplotype configuration which is consistent with an input genotyped pedigree 
and induces the minimum number of recombinations. The formulation naturally 
arises since recombinations are the most common source of genetic variation. 
However, the HI problem where mutations were allowed instead of recombina- 
tions has been solved by an ILP-based algorithm 1 14 . An efficient heuristic when 
both recombinations and mutations are allowed has been presented in [8]. 

Despite several remarkable advances in genotyping technologies, genotypes 
are usually affected by a small percentage of errors and missing data which, even 
at very small rates (< 0.5%), could heavily affect the outcome of subsequent anal- 
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yses. Moreover, missing genotypes could represent a significant portion (5% or 
more) of tlie dataset due to uncertainty in genotype call procedures or unavail- 
ability of the DNA sample of some individuals of the pedigree. These aspects 
shared by almost all actual datasets were not considered in the original MRHC 
formulation, limiting its applicability and practical relevance. 

In this paper, we propose the Haplotype Configuration with Recom- 
binations AND Errors (HCRE) problem, which generalizes the MRHC for- 
mulation by allowing genotyping errors and missing data in addition to recombi- 
nation events. Polynomial-time exact algorithms for HCRE are unlikely to exist 
since the problem is APX-hard even on simple instances. The main contribution 
of this paper is a practical exact algorithm for HCRE, based on a reduction from 
HCRE to the well-known Satisfiability problem (SAT), for which extremely ef- 
ficient solvers are known. Our reduction exploits some characteristics of one of 
the best performing SAT solvers, CryptoMiniSat [lO], in order to compute a 
solution for most of the practical instances with modest computing resources. 
An extensive experimental evaluation of our algorithm under several simulated 
scenarios and on a real dairy cattle population demonstrates its accuracy and 
performance. 

2 The Computational Problem 

In this section we define the basic notions that will be studied in the rest of 
the work. A pedigree graph is an oriented acyclic graph P = {V, E) such that 
(i) vertices correspond to individuals and are partitioned into male and female 
vertices (i.e., V = M\JF,M(^F = 0), {ii) each vertex has indegree or 2, 
and {in) if a vertex has indegree 2, then one edge must come from a male node 
and the other from a female node. If a vertex v has indegree 0, then v is called 
founder. For each edge (p, c) € E, we say that p is a parent of c and c is a child of 
p. More precisely, p is the father (mother, resp.) of c if p is male (female, resp.). 
The (possibly incomplete) genotype of an individual i is a n-long vector gi 
over the set {0, 1, 2, *} where we follow the convention of encoding the unordered 
pair of alleles {0, 0} as 0, {1, 1} as 1, and {0, 1} as 2, while * represents a missing 
(or "not called") genotype. A genotype is complete if it does not contain the * 
element, otherwise is incomplete. An individual c is said to be heterozygous in a 
given locus i if gc[i] = 2, and homozygous if gc[i] G {0, 1}. A haplotype configu- 
ration H is an assignment of a pair of haplotypes {h^,hj) to each individual i 
and a pair of source vectors {sf^i,Sm,i) to each non- founder individual i of the 
pedigree (where / and m are the parents of i). Both a haplotype and an source 
vector are binary n-long vectors. In a haplotype and 1 are the two (major and 
minor) alleles. Informally, the source vector Sp^i associated with a haplotype /i^, 
where p is a parent of i, indicates if the allele h\[l] has been inherited from the 
paternal (sp,i['] = 0) or maternal (sp,i[Z] = 1) haplotype of p. More formally, let 
hi = {hf, h\) be the haplotypes of a non-founder individual i. Then hi is consis- 
tent with the Mendelian laws of inheritance for a locus I if (a) h^[l] — hV [I] 
where / is the father of i, and (6) h\[l] — h^m' [I] where m is the mother of i. 
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Let hi = (/i?, h\) be the haplotypes of an individual i with genotype gi. Then hi 
is consistent with gi in a locus Z if h^[l] = hl[l] = gi[^] if ' is homozygous, and 
{/i°[Z], hjll]} = {0, 1} if Z is heterozygous. A haplotype h^ of individual i inherited 
from its parent p contains a recombination at locus / > if Sp^i [I — 1] y^ Spj [I] . 

In this work, we study the (r, e)-HC problem, which formalizes the HI prob- 
lem on pedigrees allowing the presence of recombinations and genotype inconsis- 
tencies. If r or e are *, then the corresponding values are intended as unbounded. 
The (r, e)-HC problem generalizes the Minimum- Recombinant Haplotype 
Configuration (MRHC) problem f^, where only recombinations are allowed 
and all the genotypes are assumed to be complete and correctly called. 

Problem 1. (r, e)-HAPLOTYPE CONFIGURATION PROBLEM ((r, e)-HC). 

Input: A genotyped pedigree P with possibly incomplete genotypes (i.e., gi[l] — * 

for some individual i and locus I). 

Output: A haplotype configuration H for the genotyped pedigree P such that: 

1. _ff is consistent with the Mendelian laws of inheritance for each individual i 
and locus l\ 

2. H is consistent with the observed genotypes in all but at most e cases; 

3. H contains at most r recombinations. 

Notice that, while we motivated our problem with a parsimonious principle, 
we preferred not to formulate (r, e)-HC as an optimization problem. In fact, 
any cost function that melds the number of recombinations and errors (i.e., by 
applying some weights to them) would potentially introduce some bias in the 
computed solution. Instead, the two parameters r and e directly map to two 
important characteristics of the dataset, the genetic distance among the markers 
and the quality of the collected genotypes, for which the researcher could provide 
good estimates. As an example, the recombination rate between adjacent markers 
in high-density panels is about 10~^ while the genotyping error rate (after quality 
check filters) is generally less than 0.5%. 

3 Reducing (r, e)-HC to SAT 

A main technical device of this work consists of a reduction from (r, e)-HC to 
SAT. In this work, an instance of SAT is a set of "extended clauses" , where each 
extended clause is either the disjunction or the exclusive-OR of literals (i.e., vari- 
ables or their negation) . The instance is satisfiable if and only if all the extended 
clauses are satisfiable. The main reason for choosing such a generalization is that 
our reduction is slightly simplified by its use since XORs of literals are additions 



over the Z2 field. Moreover the SAT solver that we selected, CryptoMiniSat 10 
is designed to solve those instances. For simplicity, we slightly abuse the lan- 
guage by indicating also the exclusive-OR of literals with the term "clause" . In 
the following we denote with the exclusive-OR, with = the equivalence, with 
A the conjunction, with V the disjunction, and with -^ the negation. 

In order to gently guide the reader, first we present the reduction for the case 
of unbounded number of recombinations and errors. Then, we will deal with the 
general problem (r, e)-HC. In our reduction we use some Boolean variables: 
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Pi [I] , mi [I] , alleles of the paternal and maternal haplotypes of individual i at 
locus I (i.e., h^[l] and hl[l]) where false is equal to and true is equal to 1; 

Sp.i[Z], which is true if the source vector of individual i (w.r.t. parent p) at 
locus I is equal to 1, and false if it is equal to 0; 

rp_i[Z], which is true if a recombination has occurred between individual i 
and its parent p at locus I (i.e., if Sp_i[l — 1] ^ Sp^i[l]); 

ei [I] , which is true iff the haplotype configuration is not consistent with the 
observed genotypes for individual i at locus /. 



To better describe our model and the associated SAT instance, we will put 
on the left the constraint we are imposing and on the right the correspond- 
ing clauses, preceded by the condition under which the constraint must hold. A 
(*, *)-HC instance can be encoded by the logic formula consisting of the conjunc- 
tion of the following clauses which encodes three kinds of constraints: constraints 
for Mendelian laws consistency, constraints for genotype consistency, and con- 
straints for recombination representation. The first class of constraints ensures 
the consistency with the Mendelian laws of inheritance between a non-founder 
individual i and its parent p. In other words, each allele of the haplotype of i 
inherited from p must be equal to one allele of one of the haplotypes of p de- 
pending on the variable Sp^i. We abstract from the gender of p and use Cp^i[l] to 
indicate the variable pi[l] Up is father of i and mi[l] if p is mother of i. 



For each individual i, each parent p of 


s, and each locus I: 








'sp,»[^]VppHV-Cp,4/] 




{iPp[l]A^SpM® 

® (mp[l]ASp^^[l])) =Cp^,[Z] 


< 


spAi]'^^Pp[i]'^cpAi] 

^Sp4l]Wmp[l]W^Cp4l] 
-^Sp4l]W ^mp[l]W Cp^l] 
^Pp[l]W^mp[l]W Cp^l] 
^Pp[l]W mp[l]W ^Cp^l] 


(3.1) 



Notice that the last two clauses of (3.11 are implied by the other clauses. 



However, their explicit inclusion in our formulation improves the propagating 
behaviour and the overall efficiency of the SAT solver p] . 

The second class of constraints ensures that the computed haplotypes are 
either consistent with the observed genotypes or that variable ei[l] is true. This 
leads to three different equations depending on the observed genotype (clearly 
no constraint is needed for an individual i and locus I where the genotype gi[l] 
is missing). 



For each individual i and locus I such that gi [^] = : 


e,[l]^iMl]^rn,[l]^0) 


(^eS]VpS]-^m,[l] 
lei[l]V^p.,[l] 
[ei[l] V^mJ/] 


(3.2) 
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For each individual i and locus I such that gi [I] = 1 



e^[l] y^ (Ml] = rn,[l] = 1) 



e,[l]Vp^[l] 



(3.3) 



For each individual i and locus I such that gi [I] — 2 



ei[l] = {pi[l] = mi[/]) 



[/]®K[Z]®m4/] (3.4) 



The last class of constraints ensures that r. 



p.n 



is true when there is a recom- 



bination between individual i and its parent p at locus / (i.e., Sp^i[l — 1] ^ Sp^JZ]). 


For each individual i, each parent p of i, and each locus Z > 1: 


rp,[l]-i^pAl-^]^^pM 


-rp,,[;]©Sp,4/-l]©Sp,,W (3.5) 



The conjunction of all previous clauses is an instance of (*,*)-HC. Notice 
that the total number of variables and clauses is 0{nm), where n is the pedigree 
size and m the length of the genotype. 

To bound the total number of errors and recombination, we simply have to 
add two cardinality constraints: 



E 



< e 



E 



' P,«L 



< r 



(3.6) 



individual i 


individual i 


locus / 


parent p of i 




locus / 



Converting a cardinality constraint X)i<i<n Wj < fc in a compact Boolean for- 
mula is not straightforward since a naive approach would consider all the subsets 
of k elements and, thus, would produce an exponential number of clauses, ruling 
out even moderate instances. This problem of devising an "efhcient" encoding 
of a cardinality constraint into a CNF formula has been investigated in the 
constraint programming literature and a promising approach has been recently 
proposed. This technique is based on the construction of cardinality networks fl] 
and essentially sorts the set {xi, . . . , Xn} of variables composing the constraint 
obtaining a permutation (yi,...,j/„) such that whenever yi is false, then all 
variables yj for j > i are also false. Consequently, bounding the number of re- 
combinations and errors consists of forcing the variables yr+i or ye+i to be false. 
Two advantages of the cardinality network approach are that only 0(nlog^ fc) 
additional clauses are required and that arc-consistency under unit propagation 
is preserved, which improves the performances of modern SAT solvers. We refer 
the reader to [I] for the detailed description of the required clauses.. 



4 Experimental Results 



An implementation of our algorithm, called reHCstar, is available at http : 
[//www. algolab. eu/reHCstar We used our implementation to investigate feasi- 
bility, accuracy and performance of our approach under several scenarios. First, 
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we analyzed the effect of changing the main parameters (such as pedigree size, 
missing rate, etc.) on the accuracy and the performance of our algorithm. Second, 
we compared reHCstar with MePhase ^15, which is a combinatorial approach 
to a slightly different formulation of the Haplotype Inference problem on pedi- 
grees. Third, we present the evaluation of reHCstar on a complex pedigree 
of a real bovine population. We tested on such pedigree the soundness of the 
(r, e)-HC formulation and the ability of reHCstar to handle these instances. 



Evaluation on Random Instances. We evaluated how the main problem 
parameters - pedigree size (n), genotype length (m), recombination probability 
(9), error probability (e), and missing probability (/i) - affect the accuracy and 
the performance of reHCstar. For each choice of the parameters, we generated 
10 different random pedigree graphs. Then we generated 10 random haplotype 
configurations for each pedigree as follows. Two haplotypes have been randomly 
generated for each founder of the pedigree. Haplotypes of non-founder individu- 
als have been uniformly sampled from those of their parents and a recombination 
has been applied at each locus with probability 9. Genotypes of the individuals 
are then computed from their haplotypes. Finally, each locus of each individual 
genotype has been replaced with a different pair of alleles with probability e 
and "masked" with probability /i (to simulate missing genotypes). Accuracy of 
the results computed by reHCstar on each instance has been evaluated with 
respect to the original haplotype configuration according to genotype imputa- 
tion error rate (the fraction of the missing genotypes that have been incorrectly 
imputed) and haplotyping error rate (the fraction of alleles that have been incor- 
rectly predicted). The latter ratio has been computed for the set of all loci and 
for the set of non-missing genotypes. Performance has been evaluated consider- 
ing the average and the maximum running time of reHCstar on a standard 
workstation with a 2.8GHz CPU. We have chosen a base set of reasonable val- 
ues for the parameters: n = 50, m = 50, 9 = 0.005, e = 0.005, and ^ = 0.05. 
We performed five series of tests; in each of these series only one parameter is 
allowed to assume different values, while the other four parameters are fixed 
to the base value. Moreover we fixed the values of r and e on each instance as 
constant proportions r^ and Cr of the total number of recombination and error 
variables, respectively. The base values of the maximum recombination rate r^ 
and the maximum error rate Cr is 0.012. Since a random generator can produce 
instances where the actual error or recombination rate is larger than 9 and e, 
the actual values of r and e given to reHCstar must be chosen appropriately. 
Table [T] summarizes the accuracy and the performance of reHCstar on two 
series that are representative of this experimental part. Due to space constraints, 
we presented the complete table in the supplementary material (Tab. A.l) and 
we only sketch the main conclusions. We notice that the average genotype im- 
putation error rate is always below 21%, the average haplotyping error rate is 
always below 8% and the average running time is always below 9 minutes, while 
the instance that took the longest time terminated in 52 minutes. The best case 
for each row is boldfaced in order to show the effects that parameter variations 
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Table 1. Summary of accuracy and performance obtained by reHCstar on randomly 
generated instances. Each table refers to a series of tests where only one parameter 
has been varied. The base values of the parameters are: n = 50, m — 50, 6 — 0.005, 
e = 0.005, fi = 0.05. The best result for each row is boldfaced. 

(a) Increasing pedigree size (n) 

Pedigree size n = 50 100 200 

No. of completed instances 100/100 100/100 100/100 

Avg. genotype imputation error rate 

Avg. liaplotyping error rate 

Avg. liaplotyping error rate (wo/missing) 

Avg. running time (in seconds) 



0.161 


0.151 


0.149 


0.043 


0.039 


0.033 


0.037 


0.034 


0.027 


15.1 


82.3 


359.4 


72.3 


479.9 


1129.3 



Max. running time (in seconds) 72.3 


479.9 


1129.3 


(b) Increasing genotype length (m) 


Genotype length m — 50 


100 


200 



No. of completed instances 

Avg. genotype imputation error rate 

Avg. haplotyping error rate 

Avg. haplotyping error rate (wo/missing) 

Avg. running time (in seconds) 

Max. running time (in seconds) 



00/100 


100/100 


100/100 


0.170 


0.174 


0.174 


0.043 


0.049 


0.063 


0.038 


0.045 


0.059 


17.0 


93.6 


505.0 


85.7 


770.4 


3127.5 



has on the outcome. The main noteworthy, albeit predictable, observation is 
that larger pedigrees require more computational resources, but result in a more 
accurate prediction. 

Comparison with a State-of-the-Art Method. In this second part, we 
compared accuracy and performance of reHCstar with another state-of-the- 



art approach for the haplotype inference problem on pedigrees: MePhase 14 . A 
second approach, PedPhase 3.0 fzl, was initially considered. A preliminary test 
revealed that PedPhase terminated without giving a solution or giving an error 
message on a significant subset of instances and we preferred to not include it 
in this comparison. MePhase is a heuristic algorithm for the haplotype configu- 
ration with mutation and errors problem on tree pedigrees (i.e., pedigrees such 
that each pair of individuals are connected by at most one directed path) and it 
is based on an ILP formulation derived from a system of linear equations over 



Z2 15 . This formulation of the HI problem asks for a haplotype configuration of 
a given genotyped pedigree allowing mutations, genotyping errors, and missing 
genotypes, while in this work we allow recombinations, genotyping errors, and 
missing genotypes. As a consequence, the comparison with MePhase involved 
randomly generated haplotype configurations on tree pedigrees (opposed to the 
general pedigrees used in the first part) without mutations and recombinations. 
To ensure a proper comparison even if our approach and MePhase were designed 
for slightly different problems, the experimental evaluation is similar to the one 
described by the authors of MePhase in flil . Tree pedigrees have been generated 
using the random generator proposed by Thomas and Cannings tl2j. Random 



(r, e)-HC via SAT 9 

haplotype configurations have been assigned to the tree pedigrees just as in the 
previous experimental part. In this case, we analyzed the effects of the following 
problem parameters on the accuracy and performance of MePhase and reHC- 
STAR: average nuclear family size (denoted with /, and represents the average 
number of individuals that compose a nuclear family) , genotype length (m) , er- 
ror prohahility (e), and missing prohahility (/i). We generated five different tree 
pedigrees for each value of / ranging from 3 to 6. For each pedigree and for each 
combination of the remaining parameters (r7i,£,/i), six random haplotype con- 
figurations have been computed. We considered the following set of parameter 
values: m e {50,100}, e € {0,0.005,0.01}, and /i G {0,0.05,0.1,0.2} for a total 
of 2880 instances. Table [2] summarizes the results of the comparison. Accuracy 
of the two approaches in imputing the missing genotypes and reconstructing the 
haplotype configuration is similar: reHCstar is definitely more accurate than 
MePhase on smaller families (/ > 4), while MePhase is slightly more accurate 
on larger families (/ > 5). A likely reason is that MePhase computes haplotype 
configurations with fewer genotyping errors than reHCstar on larger families, 
as witnessed by the fact that, on those families, the average number of errors is 
smaller for MePhase than for reHCstar. Also, on those families, precision and 
recall (i.e., the fraction of original and computed errors that have been correctly 
identified) are better for MePhase than for reHCstar. 

However, reHCstar is considerably more efficient than MePhase: the dif- 
ference is most remarkable on large families (/ = 6) where MePhase took 1092 
seconds on average compared to the 23 seconds required by ReHCstar. More- 
over, MePhase was not able to solve 266 of the original 2880 instances (9.2%) 
within a time limit of an hour for each instance, while reHCstar completed 
the whole dataset within the same time limit. MePhase could not solve within 
the time limit 31.2% of the instances with / = 6. Overall, ReHCstar is much 
faster and more scalable than MePhase while maintaining comparable accuracy. 



Evaluation on a Real Genotyped Pedigree. In the last part, we evaluated 
our approach on a real genotyped pedigree. The pedigree describes part of a 
dairy cattle population that has been obtained from the Italian Brown Swiss 
Breeders Association (ANARB). Genotypes have been obtained from a Bovi- 
neSNP50 BeadChip and were restricted to 2570 loci on Chromosome 6. The 
pedigree (represented in the Supplementary Figure S.l) contains 207 individu- 
als - 130 males and 77 females - with 93 founders. Unlike humans pedigrees, 
livestock pedigrees are composed by large families of half-siblings. In fact, the 
pedigree contains two bulls that generated 17 and 15 offspring, respectively, 
while other 12 bulls generated 50 offspring. A second typical characteristic of 
livestock pedigrees is the presence of several individuals that are not genotyped. 
The reason is twofold: the introduction of genotyping technologies for livestock 
species is recent. Therefore, most of the animals of the pedigree are not alive 
and cannot be genotyped. Moreover, the extraction of genotypes is quite expen- 
sive, hence it is performed only on animals of higher commercial value, such as 
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Table 2. Results of the experimental comparison between MePhase [l4] and reHC- 
STAR. The first column (/) indicates the average size of nuclear family, the second 
reports the average number of errors present in the generated haplotype configuration 
(original errors), the next four columns (and the last one) are defined as the rows of 
Table IT] The 7th column indicates the average number of errors present in the recon- 
structed haplotype configuration (computed errors). Precision and recall are defined 
as the proportion of original (computed, resp.) errors that have been correctly iden- 
tified. Since MePhase was not able to solve all the instances within the 1-hour time 
limit, we also computed the measures obtained by reHCstar restricted to the subset 
of instances completed by MePhase. 





no. of 
errors 








MePhase 










/ 


compl. 

instan. 


genot. 

imput. 
error 
rate 


haplot. 
error 
rate 


haplot. 

error 

rate 

wo/miss 


no. of 
comp. 
errors precision 


recall 


avg. 

running 

time 

(sec) 


3 
4 

5 
6 


17.1 
17.1 
16.1 
13.8 


719 
713 

687 
495 


0.234 
0.092 

0.048 
0.022 


0.044 
0.017 
0.008 
0.003 


0.024 
0.007 
0.003 
0.001 




21.1 
17.1 
15.8 
13.6 


0.503 
0.751 
0.854 
0.926 


0.376 
0.627 
0.788 
0.921 


5.8 

16.0 

83.1 

1092.1 


Overall 


16.2 


2614 


0.106 


0.019 


0.009 




17.2 


0.739 


0.630 


234.6 












reHCi 


3TAR 








3 
4 
5 
6 


17.1 
17.1 
16.1 
13.8 


720 
720 
720 
720 


0.123 
0.072 
0.049 
0.024 


0.025 
0.016 
0.011 
0.006 


0.014 
0.009 
0.007 
0.004 




16.8 
21.0 
21.6 
22.0 


0.663 
0.654 
0.629 
0.678 


0.643 
0.745 
0.765 
0.841 


15.2 
19.5 
19.7 
22.6 


Overall 


16.2 


2880 


0.067 


0.015 


0.008 




20.3 


0.657 


0.737 


19.2 








reHCstar (on instances completed also 


by Me 


Phase) 




3 
4 
5 
6 


17.1 
17.1 
16.1 
13.8 


719 
713 
687 
495 


0.123 
0.072 
0.050 
0.024 


0.025 
0.016 
0.011 
0.006 


0.014 
0.009 
0.007 
0.004 




16.8 
20.9 
20.2 
17.8 


0.663 
0.654 
0.627 
0.671 


0.643 
0.744 
0.762 
0.839 


15.2 
19.3 
16.5 
14.8 


Overall 


16.2 


2614 


0.071 


0.015 


0.009 




19.0 


0.653 


0.727 


16.6 



breeding animals. In our pedigree we have the genotypes of only 105 males (and 
no females) . Consequently almost half of all genotypes are missing. 

The aim of this part of the experimentation is twofold: (i) to validate the 
minimum-recombinant and minimum-error model implicitly assumed in the for- 
mulation of the (r, e)-HC problem and, (ii) to empirically prove that ReHCstar 
can handle large, real-world pedigrees with a lot of non-genotyped individuals. 

We prepared 6 instances from the original data by selecting 6 subsets of 50 
original markers spaced at different distances d. The first subset has distance 
d = 1 or, in other words, is composed by the first 50 original markers. The 
second subset has distance d = 2 (i.e., we selected every other marker), the third 
has distance d = 10 (i.e., we selected one marker every ten original subsequent 
markers), and the other three have distance d = 15,20,25, respectively. The 
genotypes used in the experimentation have been obtained by restricting the 
original genotypes to the subsets of selected markers. This process simulates 
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Table 3. Summary of the minimum number of recombinations and errors needed to 
solve the livestock pedigree at 6 different marker distances d — 1,2, 10, 15, 20, 25. The 
first two columns indicate the maximum recombinations and error rates given in input 
to reHCstar. Each other cell has either indicated the number of recombinations (r) 
and errors (e) in the resulting haplotype configuration or "no solution" if no solution 
exists within the given recombination and error rates or "timeout" if no solution has 
been found within the time limit. Blank cells indicate instances already solved with a 
more stringent choice of recombination and error rates. 



Rccomb. 


Error 
rate Cj. 






Marker distances d 






rate Vr 


1 


2 


10 


15 


20 


25 


0.0 


0.0 


no sol. 


no sol. 


no sol. 


no sol. 


no sol. 


no sol. 


0.0005 
0.001 
0.005 


0.0 
0.0 
0.0 


r = 4 


r=5 


no sol. 
no sol. 
r=25 


no sol. 
no sol. 
r=50 


no sol. 
no sol. 
r=52 


no sol. 
no sol. 
r=53 



0.0 


0.0005 


0.0 


0.001 


0.0 


0.005 


0.0 


0.01 


0.0 


0.02 



.o sol. 


no sol. 


no sol. 


no sol. 


no sol. 


e^5 


no sol. 


no sol. 


no sol. 


no sol. 




e^27 


timeout 


no sol. 


timeout 






e^45 


timeout 
e^94 


timeout 

e^89 



0.0005 


0.0005 


r=0, e=l 


r=5, 


e=0 


no sol. 


no sol. 


no sol. 


no sol. 


0.0005 


0.005 








r=6, e=20 


timeout 


timeout 


timeout 


0.0005 


0.01 










r=0. e=45 


timeout 


timeout 


0.0005 


0.02 












r=0, e=94 


r=6, e=81 


0.001 


0.0005 








r=12, e=3 


no sol. 


no sol. 


no sol. 


0.001 


0.001 








r-11, e=6 


timeout 


timeout 


no sol. 


0.005 


0.0005 










r=40, e=3 


r=48, e=3 


r=50, e=l 



genotypes collected at different "densities" . Since the closer the loci, the stronger 
their linkage, we expect that high-density genotypes (the ones with distance d = 
1 and d = 2) can be solved with only a few recombinations and/or errors, while 
low-density genotypes should require a larger amount of recombinations and/or 
errors. To test this hypothesis, we ran reHCstar over the 6 instances with 
different maximum recombination and error rates in order to find the haplotype 
configuration with the smallest number of recombinations and errors. 

The results of this experiment (Table |3]) reveal a clear trend: genotypes with 
higher intra-marker distance (i.e., lower densities) require considerably more re- 
combinations and/or errors than high-density genotypes. In fact, it was possible 
to find a haplotype configuration with a single genotyping error (and no recom- 
bination) that solves the instance with the highest density {d — 1), while 45 
errors (or 50 recombinations) were needed in the solution of the medium-density 
instance (d = 15), and 89 errors (or 6 recombinations and 81 errors) were needed 
in the lowest-density instance {d = 25). Lower-density instances (d > 15) have 
not been solved within a time limit of 3 hours for some particular choices of 
maximum recombination and error rates. 

Our overall conclusion is that the results of the three experimental parts 
support the soundness of the model and the feasibility of the approach. In fact, 
reHCstar has computed solutions with good accuracy on simulated instances 
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even in limit cases (pedigrees with many recombinations and errors and/or many 
untyped loci) and has found haplotype configurations with a few recombinations 
and errors on some real instances. The comparison with MePhase, revealed that 
ReHCstar has a comparable accuracy, but it is much faster and better scales 
to large and complex pedigrees. 

Acknowledgments. We would like to thank Wei-Bung Wang for valuable dis- 
cussions and sharing the MePhase code with us, and Dr. Enrico Santus, director 
of "Associazione Nazionale Allevatori Razza Bruna" , Bussolcngo, Italy, for pro- 
viding the pedigree and the genotypes used in this work. 
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Supplementary Material 

Supplementary Table S.l. Summary of accuracy and performance obtained 
by reHCstar on randomly generated instances. Each table refers to a series of 
tests where only one parameter has been varied. The base values of the param- 
eters are: pedigree size n — 50, genotype length m ~ 50, recombination proba- 
bility = 0.005, error probability e = 0.005, and missing probability /i — 0.05. 
The best result for each row is boldfaced. 

(a) Increasing pedigree size (n) 

Pedigree size n = 50 100 200 

No. of completed instances 

Avg. genotype imputation error rate 

Avg. iraplotyping error rate 

Avg. iraplotyping error rate (wo/missing) 

Avg. running time (in seconds) 



0/100 


100/100 


100/100 


0.161 


0.151 


0.149 


0.043 


0.039 


0.033 


0.037 


0.034 


0.027 


15.1 


82.3 


359.4 


72.3 


479.9 


1129.3 



Max. running time (in seconds) 72.3 


479.9 


1129.3 


(b) Increasing genotype length (m) 


Genotype lengtlr m — 50 


100 


200 



No. of completed instances 

Avg. genotype imputation error rate 

Avg. haplotyping error rate 

Avg. haplotyping error rate (wo/missing 

Avg. running time (in seconds) 

Max. running time (in seconds) 



)0/100 


100/100 


100/100 


0.170 


0.174 


0.174 


0.043 


0.049 


0.063 


0.038 


0.045 


0.059 


17.0 


93.6 


505.0 


85.7 


770.4 


3127.5 



(c) Increasing recombination probability (0) 



Recombination probability ^ 0.0 0.005 0.01 0.02 



No. of completed instances 

Avg. genotype imputation error rate 

Avg. haplotyping error rate 

Avg. haplotyping error rate (wo/missing) 

Avg. running time (in seconds) 

Max. running time (in seconds) 



)0/100 


100/100 


100/100 


100/100 


0.174 


0.173 


0.174 


0.185 


0.047 


0.050 


0.057 


0.077 


0.040 


0.047 


0.052 


0.072 


5.0 


7.0 


13.0 


10.5 


14.7 


50.1 


75.7 


62.6 



(d) Increasing error probability (s) 



Error probability e — 0.0 0.005 



No. of completed instances 

Avg. genotype imputation error rate 

Avg. haplotyping error rate 

Avg. haplotyping error rate (wo/missing) 

Avg. running time (in seconds) 



100/100 


100/100 


100/100 


0.172 


0.166 


0.166 


0.048 


0.047 


0.048 


0.042 


0.042 


0.044 


12.8 


28.9 


74.2 


55.5 


109.8 


485.2 



Max. running time (in seconds) 55.5 


109.8 


485.2 


(e) Increasing missing probability (fi) 


Missing probability ^ — 0.0 0.05 


0.1 


0.2 



No. of completed instances 

Avg. genotype imputation error rate 

Avg. haplotyping error rate 

Avg. haplotyping error rate (wo/missing) 

Avg. running time (in seconds) 

Max. running time (in seconds) 



)0/100 


100/100 


100/100 


100/100 


— 


0.167 


0.183 


0.207 


0.032 


0.041 


0.052 


0.071 


0.032 


0.034 


0.041 


0.052 


16.6 


15.3 


17.9 


14.0 


79.7 


76.0 


89.0 


69.2 
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